Applications Involved In Disaster Recovery Failures 2024

Published 2 days ago5 min readMSPs Addressing Burnout...
Coping with MSP Burnout

Backup applications, their success and failures during significant disaster recovery incidents will be the uplifting discussion of today's article.

Try as I might and I have, there is nowhere that universally reports which backup application was involved in a serious disaster incident or breaks it down into a successful or failed disaster recovery attempt.

Not going to lie, that blows my mind. It means as service providers, we have to spend hours of our time trawling through forums looking for anecdotal evidence attempting to select a backup solution that fails the least and hoping that the posts are not out of date or skewed in some way.

What I will do though is do that research and document below what I have found out. I will also attempt to put together a basic evaluation checklist that could or should be used to evaluate the performance of backup applications that have been involved in disaster recovery incidents.

While you are here, Take a look at some of our other Tech Business Consulting related articles below that may interest you:

Backup Solutions To Avoid

After having undertaken quite a bit of research on this I have come to the conclusion that there are currently two backup solutions that should actively be avoided. The first is the Kaseya Spanning Backup Solution which I thought needed its very own article here.

The second is an old favorite of mine which has since been purchased by Arcserve and seems to have had a big fall from grace since that point in time. It is such a shame because I associate this product with a bit of nostalgia. I have added some of the many links on the problems people are experiencing with Shadow Protect/Storagecraft below:

How Often Do Disaster Recovery Attempts Fail?

According to multiple sources, they fail from between 36% and 58% of the time where the rubber meets the road. Let's take the middle ground here and go with about 50% of the time. This means that 50% of the time when you get that dreaded call from the client that means your disaster recovery plan needs to be put into action, 50% of the time, the recovery evolution is going to fail. 

Now that does not necessarily mean complete failure. If you are like me, your head is sometimes back in the 90s where backup failures were measured like a light switch, on or off, yes or no. Today backup failures need to be measured in time as well as a number of other metrics.

If it takes 7 days to recover a backup and that backup is restored perfectly when the client needed their data restored in 24 hours then it should be classed as a failure even if that is what they initially agreed to. 

Now this article is not really interested in the majority of ways a backup can fail which is nearly always down to either a misconfiguration, lack of training or expertise among many others. We are only really interested in how often named backup applications fail to recover data in a disaster recovery scenario due to an application failure. 

The reason for this is because it is one of the few instances where we only have one chance to avoid this level of pain before passing that point where it is out of our hands.

That point is the vendor selection phase just before we hand over our credit card. That is the only point where we have the ability to avoid backup vendors whose products do not deliver and are basically ticking time bombs sitting on a shelf.

While we are only interested in the backup application based failure, for us to make sense of it all, we still need to know the other areas that backups often fail and separate them into their own groups.

Types Of Backup Failure

Not all scenarios when a backup application fails to restore during a disaster recovery scenario can be attributed to the backup application being used. There are times where the reason the disaster recovery attempt failed was due to things like insufficient training or improper use of the backup application.

If I was tasked with coming up with an evaluation sheet that could be used to universally measure the success and failure of a backup application in a disaster recovery incident then these are the areas I would like to see measured and segmented as this would give service providers some valuable information via an independent impartial body that will assist in selecting what is often a very expensive and vital component of our service provider stack.

I would also separate the options into three, where either the backup application is at fault, the service provider is at fault or a third party is at fault because it would be unfair to assign lack of training or misconfiguration or a failure of the storage provider to the backup application vendor. I am a strong believer in taking personal responsibility for areas under your control and if a service provider fails to train staff to a level where they can effectively and correctly configure a backup solution then that is on them.

Backup Application Failure

  • Failed update introducing product failure
  • Falsely indicating a successful backup 
  • Inadequate alerting when a failure occurs
  • Known issue ignored
  • Known compatibility issues
  • Poorly designed not fit for purpose

Managed Service Provider Failure

  • Alerting incorrectly setup
  • Alerting not monitored by anyone
  • Mismatched config to client needs
  • Poorly chosen product for the job it is used for
  • Poor training of staff
  • Misunderstandings
  • Zero Disaster Recovery Testing
  • Ransomware Incidents

Third Party Failure

  • ISP Failure
  • 3rd party storage failure

Assigning Blame When Disaster Recovery Fails

It is not all that easy surprisingly to assign blame when there is a failure caused by a backup application. I am sure most of you hold service level agreements with your clients where you make promises based on rectifying an issue within a certain time frame.

Picture this, you start getting alerts the backup is failing and you miss a round of scheduled backups over a 24 hour period. You do the right thing and inform clients that it is out of your hands as you find out the problem is due to a configuration error made by the backup application vendor.

All good? What happens at the tail end of that if your client suddenly has a catastrophic failure and requires your disaster recovery plan to be implemented. Number one issue here is that for the past 24 hours you have not been backing up their information as stated in the SLA and even though they have been advised, it still means you are failing to meet your agreed upon SLA.

Secondly, if the problem with the backup vendor is bad enough there may be issues where you are unable to access or recover backups you know you have while the application vendor goes about fixing whatever problem caused the backups to fail in the first place.

It does not matter that these events are completely out of your control, all that matters is that you have a signed contract with clients promising to deliver a level of service that you are currently unable to meet.

It gets even worse if say the backup application vendor has caused a problem that has gone unnoticed for an extended period of time and it is found out you have not been backing up for several weeks even if all of the checks are coming up green.

The only thing that can get you out of a situation like this is a good quality cyber insurance policy because someone is going to have to pay in this situation, it is probably not going to be the backup vendor especially if they are large because they usually have bullet proof disclaimers along with the fact it is not in the best interest of the very fabric of how business interoperates to open a large business up to a company ending lawsuits everytime the service they are providing causes downtime for their clients (you)

The courts will limit the potential damage they can face. Put another way, imagine if Microsoft was held accountable everytime they caused damage to their customers because of outages or security incidents or because the application they provide has been found to have an inherent security flaw? They would have been sued out of existence years ago and unfortunately for the smaller guys down the trough that means we are the ones that end up being held accountable for the losses often for events out of our control.

Blaming Backup App For Failure Is Almost Impossible

In researching this article, I initially thought it would be quite easy to find the answers to the question regarding which backup application on the market has the highest level of failure as well as find out which backup app has the least amount of failures over time.

I am specifically talking about times where the service provider using the back up solution had no possible way of either knowing the failure was going to occur and the fault that caused it was specifically due to the backup application itself.

What I have found is that accountability for the failure always leads back to the service provider. Say the application had an update that caused it to fail due to a VSS issue with one version of Windows Server with a particular update that just so happened to affect 8 servers you backup. On Top of that, the application was applying a successful backup alert to a failed backup.

Pretty clear cut that this issue was caused by a problem the backup app vendor introduced and therefore they should be held accountable right? You would be wrong. This is for a number of reasons. As the provider you could have provided a secondary backup solution such as a local backup copy using a different backup application.

You could have set up an automated restore testing environment that mounts the backup images and a range of other safeguards.

Ultimately many people mistake the legal system for a process that is in some way fair. People get successfully sued every day over events they had no control over. 

That is why it is very important to have a cyber security policy that provides good coverage for events such as backup failure or ransomware incidents, because you cannot possibly foresee every area where you may be held accountable when things go wrong.

I would go one step further and also advise that every single one of your clients should have their own cyber security insurance policy. With so much risk in the MSP space these days it is not uncommon to be refused coverage outright however you can not only substantially reduce your policy cost, you also decrease the chances of being refused coverage with this strategy along with ultimately helping protect you and your clients interests. 

Conclusion

Backup application failure where the failure is both catastrophic as well going unnoticed due to the application falsely reporting successful backups when they are in a failed state are actually quite rare especially if you both research the backup vendor and go with one of the well known companies.

From what I could find, disaster recovery failures are almost always caused by misconfiguration or negligent conduct by the service provider. Less than 1% of all disaster recovery failures were caused by the backup application faults where the service provider was not aware and even if they were, could not have rectified the issue.

In other words, it is a non issue and even then, it is such a complex path to navigate that if it does occur, leaning on your cyber-insurance policy is the wisest decision you can make rather than attempting to argue your case on the basis of fairness.

If life was fair not a single organism could exist. Ultimately nobody cares about fairness except for those unfairly treated and the outcome is almost always going to give you more pain because the outcome is going to almost always be unfair in some way. That is why we have technology insurance. Let the insurance fight the issue and concentrate on what you are good at, which is running your business.

We have a number of other backup articles specifically related to clients listed below that will provide you with more detailed information on a number of related topics:

https://optimizeddocs.com/blogs/backups/backups-client-index

Our team specializes in strategies for IT service providers and we assist in improving profit margins through standardization and consistent record keeping strategies, so you can be confident that our content is tailored to your needs.

Please feel free to explore our other articles and click on any that interest you. If you have any questions or would like to learn more about how we can help you with your documentation.

We have a number of other business and consulting related articles listed below that will provide you with more detailed information on a number of related topics:

https://optimizeddocs.com/blogs/consulting/consulting-index-page-01

Our team specializes in strategies for IT management organizations and we assist in improving profit margins through standardization and consistent record keeping strategies, so you can be confident that our content is tailored to your needs.

Please feel free to explore our other articles and click on any that interest you. If you have any questions or would like to learn more about how we can help you with your documentation needs, please click the "Get In Touch" button to the left and we will be happy to assist you. Thank you for choosing us as your trusted source for technology documentation.

MSP Consulting